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Standard Error of an Equating by Item Response Theory 

- Abs tract 

A formula is derived for the asymptotic standard error of a true-score 

equating by item response theory. The equating method is applicable when 

the two tests to be equated are administered to different groups along with 

an f anchor* test. 1 Numerical standard errors are shown f or ,an actual 
/ 

equating 1) comparing the standard errors of IRT, linear, and equiperc entile 
methods; 2) illustrating thfe effect of the length of the anchor test on the 
standard error of the equating. 
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Standard Error of an Equating by Item Response Theory* 

In item response theory (IRT), an examinee f s expected number-right, 
sco^e l S on test X is equal to the test characteristic function 
evaluated at the examinee 1 s ability ^evel 9 : 

n 

•5 - £ P (6) ■ (i>) 

where P^S) is the item response function, the probability of a cor- 
rect answer to item i at ^ability level 9 , If we have a second test, 
, Y , measuring the same ability as X , the expected number-right score 
n on thi^Jje^t *may|j>e written as 

\ 

y v 
n = I p (8) - * (4 f ^ 

h-l h * 

/ • 

Equations (l 1 *) and (4 f ) are parametric equations for the functional 
relationship between K and p . Note that this relationship is an 
exact mathematical one, not a statistical association. Given any 9 , 
(l f ) an<T (4 f ) determine a pair of values, £ and n , that represent 
the* same ability level as 8 # Pairs of values (£,n«) determined in 
this way are equated * In practice, it is often assumed that the 
functional relationship of. n- to £ given by (V) and (4*) can slso 
be applied to actual number-right scores on the two tests, producing * 
an equating of these scores*- 



*This work was supported in part by contract N00014-80-C-0402, 
project designation NR 150-453 between the Office of Naval Research and 
Educational Teating Service, Reproduction in whole or in part is permitted 
for any purpose of the United States Government. 



Here, we simply deal with the 'sampling errors in estimating the 
"equating relationship of n to £ . • In (I 1 ) and (4 f ), estimated 
item parameters must be used. These are the source of the sampling 
errors in IRT equating. Note that the ability estimates for individual 
examinees are not used in (l f ) and (4 f ) and thus will not appear in 
our formulas* Until now, the -sampling errors of IRT equatings have 
never been timated. v - 
% 

Data 

4 

In IRT equating, we frequently have a set of common items that are 
administered to all fexaminees. These are needed in order to get 
Test Y item parameters on the same scale as Test X item parameters. 
If the* common items are external to tests X and Y ,.as assumed here, 
the common items are called the anchor test , or, in the present report, 
Test W . The sampling variance formulas to be obtained here can be 
modified in obviou& ways for the case where some or all of the common 
items Are internal to the tests that are being equated. 

Designate the examii^eSlTwho took both Tests X and W as 
Group 1; designate the examinees who took Tests Y and- W as Group 2. 
Typically, every examinee fal3^ in one of these two groups. 

In practice when there is a series of test forms A,B, . . . ,X,Y,Z, . . 
(say), the 'Group l f data on Test X are processed as soon as they 
become available in order to equate Test X to the preceding form. 
When the Group 2 data become available at some^ later date, it is* 
often considered uneconomical to rerun the Group 1 data, so Group 2 is 
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run by itself. This case, where item parameters for Groups 1 and 2 

i 

are estimated separately, is the case to be considered here. (The 
simplifying assumption that is used below to 4pproximate the sampling 
variances of the estimated item parameters is not available in the 
alternative case where Groups 1 snd 2 are pooled and all parameters^ 
estimated simultaneously,) 

New Equating Formulas 

/ 

When parameters are estimated separately for groups 1 and 2, 
the item parameters and 6 in (4 f ) have a "different origin and scale 
from the item parameters and 8 in (l 1 ). It is thus no longer 
possible simply to eliminate 6 from (l 1 ) and (4') to obtain the , 
relation of n to £ . The customary procedure in this situation is 
to use the anchor test to transform the Group 2 item parameters on to the 
scale of the Group 1 item parameters. This procedure adds to the sampling 
variance of the transformed item parameters and greatly complicates any 
determination of the sampling variance of the subsequent equating. The 
procedures and formulas given belour avoid this problem since they avoid , 
jrmStion. of item parameters. . , N 




fl 1 ) and (4 f ) remain unchanged except that additional * 

)lained below) are used. In particular, the symbols 

<- * ■ 

8^ and must be distinguished because groups 1 and 2 £ise t dif ferent 

ability scales: 



g 



8 • 



' 9 
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n- rP 4 (e,) . (4) 

g 6 



The item response functions here are written P where p = 1,2,3,4 

8P 

refers to (test X , group 1), '(test W y group 1), (test W , group 2), 

and (test Y , group 2) respectively, and g * 1,2,.,,, n where u 

x P P i 

is the number of items in the appropriate test. 

Let us write down similar equations for the expected number-right 
score a) on andhor test W : ~~~ 



a) = E P ,(9,) T - <2) 

g g2 1 - 



g g3 , 2 

the equation numbering keeps the tests in convenient order. The desired 
equation relation between n and C can be obtained by eliminating 
8^ , 9 ^ 9 and u> from these four eqiva^ons. j 

Computer programs are available*for equating n to C by ^ 
eliminating 8 from (l f ) and (4*). These same programs can^be used 

to equate w to C in one step, using (1) and (2), then to Equate 

I 

n to w in a second step using (3) and (4)f This produces ap equating* 

bf ri to £ for the presently relevant situation where Group 1 and 

\ 

"Group 2 parameters are not on the samfe scale, " 

An estimated equating is obtained from (1) -<4) after replacing 
the true item parameters by .their maximum likelihood estimates. 1 . Using 
carets to denote this change, we have [ 



€ = E P , ( 6, ) > . U") 

8 81 1 - . 



oj = E P ,(e,) , (2") 
8 82 1 



oj = E p ,(eJ , O") 

g 83 2 



n = e p , (e„) . • ^"-fc'V 

■ g 84 2 



These equations shew that n is a function of all the estimated item 
parameters together with the specified value of £ . 

/ 

Derivatives 



For item g . instead of using a , b , and c to denote ' 

g g g 



the three parameters commonly used in IRT, let lis use > *-2gp 

i 

and t. , respectively. We will need certain derivatives for 
3gp 

r,= 1,2,3 , obtained from (l ,, )-(4 n ): . 



^--p^Cej , • • • (5) 



3t rg4 84 2 



3 ^rg3 83 - 2 



rg2 



where .denotes the derivative of/? with respect to t ^ . 

. gp gP rgp 

/ 



70 



-6V 



* 



Similarly, 



511 - I 



~9 



where P f denotes a derivative with respect e to e • Using the formula 
for the derivative of an implicit function, we also find from (l M )-(4 n ^ s/ 



for r =» 1,2,3 



38^_ Pff(9 2 ) 
g 



' 3t r 8 3 £ - P 83 (e 2> 



"i \ p S )(8 i' 

E 
g 



; 



3 1 . j p'. (e,) 

rgl gl 1 



J 

3 6 2 i 



9u E P' (e 0 ) : • 

g * g3 2 

Using the chain rule for derivatives, we- find from the above 

\ 

formulas: 



a9 2 _(r)„ , g S*- 2 , ' ; (6) 



3t ; g 3 9e 2 3t rg3 * 3 2 j^W 
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3n . 3n "2 3a, (r), ) fs_J_ (7) 



i 



3„ "2 3 . 39 i p (r )(9 ,- mjLJ-m 8 - 2 - • m 

3 t rgl *3 e 2 3. 36 l3 t rgl " gl «V r P^COj) .1 P g3 (9 2 ) 



Given 5 , we are now in a posic±urr express ri as a series in 

powers of t - t ( r = 1,2,3 .; g - l,2,...,n ; p = 1,2,3,4 ). 
rgp r 8P^ * £ 

We will write instead of 3n/9t rgp and ^" rgpshq instead^ of - 



rgp 

rgp" shq-" ' / 



3 n/St.Jt^ . <; 



n = n+ pgr (tr 8P" W^ff 



+ -HHEE (t .-t )(t t - t t 1" u + ... .(9) 
+ 2 ££2.L2.L V.c - rgp ; V ghq shq ' rgpshq 

p q g h r & , j 



Sampling Variance 
Transposing, squaring, and taking expectations, we find from 

* ) 

(9) for fixed £ , • 1 



var n - f <n -n> .- z z z z z z n; n s ' h Cov(t ,t ) + ... . 

pqghrs 
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When item parameters and abilities are both estimated simultaneously 
by maximum likelihood, it is not practical to use the usual sampling 
covariance formulas for all estimators simultaneously. As a rough 
approximation, it is customary (Lord, 1980, Section 12.3) to use instead 
the (simpler) formulas for the case where, the ability parameters are known 
i& will use this rough approximation here to find Cov ( t rgp > t: s hq ) ' " 
Because of this approximation, our sampling variance of equating 
will be an underestimate. ■ v 

• In this case, all covariances involving two different i^ems are 
exactly zero, as are all covariances involving a single item administered 
to^o different groups of examinees. All nonzero variances and co- 
variances ^re inversely proportional to N , the number of examinees. 
We now have K ^ 

£ 3 3 A A 

Var n » I Z [ I ' z (n* n' Cov(t ,t )} 
p g r-1 s=i rgp 8gp rgp 8gp ' 

33 3 3 3 3 3 

+ EE E ^ } + E Z E I {"},+ •••] 

r v 

Some higher order terms are indicated here in order to make^clear th^X 

the number of terms under summation signs does not increas? too rapidly. 

« * # 

The triple summation represents 3 times as many terms as the dpuWLe 

* 3/2 

summatipn, but each term in the triple summation is divided by N 
whereafif each" term in the double summation is only divided by N . When . 
N ie several thousand, it is reasonable to expect that the higher 

« 

order terms ^an be neglected, as is customary with asymptotic variances. 
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Our final asymptotic formula, then is 

/ " , A 4 "p 3 3 • * A 

Var n = I E Z In' n' Cov(fc ,t ) . (10) 
p-1 g=l r=l s=l , rgp s * p rgp sgp w 

The n 1 values required here are computed from (5) - (8), The 
covarainces^are obtained by the usual formulas for coV^riances of maximum 
likelihood estimators of item parameters when ability parameters ar4 
fij&d (Lord, 1980, p. 191). 



. Practical Application 

Without 4a ta, it is difficult to make inferences about the magnitude 
of the sampling errors in IRT equating. Will they be larger or smaller 
than ttie~sampling errors ia conventional linear equating? In conventional 
equipercentile equating? y Do sampling errors become large or small at 

-extreme score levels? 

Equation (10) has been applied to an equating of the Verbal score on 
the 90-item Form VSA4 of the Scholastic Aptitude Test (12/73 administration) 
to the 85-item Form XSA2 Verbal score'. (4'/75 administration). All examinees 
took an SAT and also a 40-item anchor test. Petersen, Cook, and Stocking 
(1980) made separate LOGIST runs on the 130 'items in the 1973 administration 

for a sample of '2665 'examinees, and'on the 125 items in the 1975 

administration for a sample of 2686 examinees. They have allowed, the 

use here of theiaf item parameter estimates. 



14 
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SAT scaled scores are a linear, .transfprmation .of formula spores (rights 

minus one-quarter wrongs)^ Our results here are for t;he hypothetical, 

case where all examinees answer all it^pis. In this special case^formula • 

scores are a linear tramsformation of number-right scores, so scaled 

scores are likewise* Since a known linear transformation A? + B 

* * * * . 

of number-right scores £ simf/ly multiplies the standard error of p 

1 by the constant A , it is not difficult to obtain scaled-score 

standard errors from (10), A computer program. to do this was written 

and run by Marilyn Wingersky. / 

For each of certain specified formula scores on XSA2, Table '1, shows 

t — - 

1) the equivalent scaled score found by tlie conventional linear procedure 

usually us^i for the SAT (Design IV A, Angoff, 1971), 2) the .standard error 



of these equaled (scaled) scores as found by the computer program AUTBST 
(Lord, 1975) assuming Ae validity of the linear model; also 3), the equi- 

V 

valent scaled score found by the IRT method of this report, and 4) the 
corresponding sca"led-*core standard error calculated from (10). The 
standard errors in Table 1 are best understood in comparison with the 
standard deviation of scaled scores, which is 106 for XSA2; and' in 
comparison with the classical test tn^dQ&^tandard error of measurement 
(due to imperfect test reliability), which is 31. Clearly the standard 
error of equating is small compared to the standard error of measurement. 

Judging by the IRT standard errors, the equating is definitely 
nonlinear, at least outside the score range from 350 to 650. The 
IRT standard errors show a continued sharp increase as the minimum 
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Ta£te 1 

A Comparison of Linear and IRT Equatings* and of Their Standard Er,roi>s 



Selected 



Linear Models 



IRT Model 



formula 


Equivalent 


■ 


Equivalent 




scores*, 


scaled 


S tandard 


scaled 


Standard 


XSA2 


•v. score 


error 


score 


error 


84 ' 


780 


■ r 


813. 8 * • 


2.3 


79.74 . 


.750 


4.2 


778.0 


4.5 


72.70 


700 


- 3.6 


' 717.6 


M 


65.65 


650 


*3.1 


658.8 


3.6 


58.61 


m< 


2.5 


602.4 


2.8 • 


51.57 


550 " 


2.1 „ 


548.0 


4 2.2 


44.52 


500 


1.7 


495.4 ' 


* 2.0 


37.48 


450 


1.5 


448/7 
399.3 


2.1 


30.43 


- 400 


1.6 


2.3 


23.39 


350 


1.8 


355.6 


• 2.8 


16.35 


300 


2.3 


•313.3 


3.6 


9.30 


250 


2.8 


270.2 


' 4.7 


'2.26 


200 


3.3 


223.0 


7.0 


-5 


150 


• 3.9 


163.5 


15.6 



*A1 though formula score is actually a discrete variable, it 
is «f or convenience tteated here as continupus. 
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possible true formula score of -5,5 is approached." At the other end of the 
score scale, the IRT standard error* increases up to a scaled score of 760 
and decreases thereafter* The reason for the decrease at the upper end is 
that for a per^ct score,' the standard error of this kind of IRT equating 
is zen^. Except* at the upper end, the IRT standard error is larger than 
* the linear. 

The results of Table 1 are displayed in Figures 1-2. The straight 

line in Figure 1 shows the linear equating of true formula score on 

* • 

XSA2 to true scaled score on VSA4. The dashed lines are drawn two* f 

standard errors above and below the straight line. 

x 

Figure 2 similarly displays the curvilinear IRT equating, of XS^2 
£g^VSA4 ^nd its standard error. The straight-line extension of the lower 
end of the equating (middle) line in Figure 2 was obtained by the method 
described in Lord (1980, pp. 210-211). Jt is shown in the figure for 
completeness*, but no standard error is shown since there is no good 
theoretical basis for such an extension. ' 

Table 2 compares present IRT equating with a conventional equipercen- 
tile equating of XSA2 to VSA4 via the anchor test. In conventional equating 
an- XSA2 scorfe and a VSA4 score each equipercentile-ly equivalent 
to a given anchor test score are taken to be equivalent to each other. 
The standard error of the resulting equipercentile equating of XSA2 
to VSA4 is given by *^SE X A2 + s ^j SM ) where the 4 SE under the radical 
sign are standard errors of separate equipercentile equatings of each 
test to the anchor test. Formu±as for SE XSA 2 and SE VSA4 are given 
in Lord (1981). 

17 




ERIC 



o 

81 



8- 



o 

CO 

Q 
W 

CJ 
CO 



8- 



I- 



20 



-30 




-20 



-10 



T 
0 



i o 



I 

20 



30 



I 

40 



50 



eo 



70 



80 



ERLC 



TRUE FORMULA SCORE 



0 00 

I c 

CD >-f 

rt rt) 
Co 

CL • 
Co 

t 3) 

rt> H 

1 rt> 

0 XI 

1 C 

so 

cr rt 

O H» 
C 3 
3 OQ 

CO O 
• Hv 

X 

> 

H» 
o ■ 

3 



CO 

o 
o 

rt> 

rt 
O 

<J 

% 

CD 
O 
03 
H» 
(D 
D. 

O , 
rt) / 



I 

H 
I 



► 

J- \ - 

-15- 



Table ^ 

A Comparison of Equipercentile and IRT Equating 
and of Their Standard Scores 

t 



Equipercentile Method IRT Model 



XSA2 


Equivalent 




Equivalent 


t 


formula 


scaled 


S tandar d 


scaled 


Standard ' 


score - 


score 


error 


score 


error 


78.1 ' 


774 


13.47 


764 


4.68 


70.6 


722 


15.85 


700 


4.18 


64.75 • 


652 


10.32 


65* 


3.44 


58.9 


602 


4.97 


605 


2.78 


52.9 


558 


4.12 


558 


2.32 


47.25 


514 


3.47 . 


515 


2.09 


40.1 


466 


3.44 


464 


2.05 


32.4 


417 


2.93 


412 


2.24 


25.75 


364 


3.37 


370 


2.63 


16.1 . 


314 


4.07 


312 


3.62 


7.6 


242 


5.70 


259" 


5.08 


-3.75* 


195 - 


7.85 


175 


12.49 



\ 
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Sipce se xsa2 # an< * SE VSA4 are estimate< * ^ rom unsmoothed data, 
the equipercentile standard errors in Table 2 fluctuate somewhat. 
Nevertheless', it is apparent that the equipercentile method has a much 
larger standard error above a scaled score* of 450. For these data, the^ 
IRT\ method shows a larger standard error than the equipercentile method 
jwzy when .the formula score is negative. 

The* standard error of equipercentile equating could be reduced by 



\ 



smoothing t(^ejf:requency distribution of raw scores before equating. 
Smoothing is undoubtedly desirable as a practical expedient; however, the 
choice of a smoothing formula is somewhat arbitrary and the smoothing is 
likely to prevent convergence of the estimated equating to its true value 
in large samples. Formulas for the standard errors of smoothed equipercentile 
equating are not presently available. 

In order to determine the effect of using a shorter anchor test, 
every other item in the anchor test was discarded and the data 
reanalyzed on the. basis of the remaining 20-item anchor test. The 
effect on the standard errors of IRT equating in sTiown in Table 3. 
The two equatings agree fairly well. At the point where the equating 
standard errors are a minimum, halving the length of the anchor test 
increases the standard error by a factor of about At the other 

score points, the effect is less. Given standard errors like those in 
Table 2, it will now be possible to make a reasonable judgment as to the 
length necessary fQr an anchor test. A. 
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Table 3 



I 

IRT Equatings and Their Scaled-Score Standard Errop^, 
a Comparison of Results Using 20- and 40-1 tern Anchor Tests 



0 







i 

Length QS 


* Anchor Test 






20 


Items 


40 


I terns 


XSA2 
formula 


Scaled 


Standard 


Scaled 


.Standard 


score 


score 


error 


score 


error 


80 


787 


5.9 


780 


4.5 


70 


698 


' ' 5.3 


695 




60 


6.15 


3.9 


. 613 


2.9 


50 


540 


3.0 


536 


2.2 


40 


467 


2.7 


.463 


• 2.0 


30 


399 


3.0 


397 . 


2.4 


20 


336 


3.9 


335 


3.2 


10 
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